Data tidying is a necessary first step for data analysis - it's the process of taking your messily formatted data (missing values, unwieldy coding/organization, etc.) and literally tidying it up so it can be easily used for downstream analyses. To quote Hadley Wickham, "Tidy datasets are easy to manipulate, model and visualise, and have a specific structure: each variable is a column, each observation is a row, and each type of observational unit is a table."
These data are actually pretty tidy, so we're going to be focusing on cleaning and manipulation, but these manipulations will give you some idea of how to tidy untidy data.
We are going to be using the data from the R package nycflights13
. There are five datasets corresponding to flights departing NYC in 2013. We will load directly into R from the library, but the repository also includes CSV files we created for the purposes of the Python demo and can also be used to load the data into our R session.
*** If you've never run Jupyter notebooks with R, please run conda install -c r r-essentials
In [ ]:
options(repos=structure(c(CRAN="http://cran.cnr.berkeley.edu/",
BioCsoft="http://www.bioconductor.org/packages/release/bioc/")))
ipak <- function(pkg){
new.pkg <- pkg[!(pkg %in% installed.packages()[, "Package"])]
if (length(new.pkg))
install.packages(new.pkg, dependencies = TRUE)
sapply(pkg, require, character.only = TRUE)
} #https://gist.github.com/stevenworthington/3178163
pkgs <- c("nycflights13",
"tidyr",
"dplyr",
"reshape",
"ggplot2",
"data.table")
ipak(pkgs)
In [ ]:
#invisible(sapply(pkgs, library, character.only=TRUE ))
In [ ]:
unzip("nycflights13.zip")
list.files()
read.delim("airlines.txt")
In [ ]:
data(flights)
flights <- data.frame(flights) ## dplyr has introduced a new data format that I am ignoring
dim(flights)
What are the first 6 rows?
In [ ]:
head(flights)
What are the last 6 rows?
In [ ]:
tail(flights)
What does the sample function do?
In [ ]:
sample(1:6,2)
What happens when I use sample for indexing?
In [ ]:
flights[sample(1:nrow(flights),10),]
In [ ]:
print(any(is.na(flights)))
print(all(is.na(flights)))
Selecting for flights where there is complete data, what are the dimensions?
complete.cases
returns a logical vector indicating whether all observations in a row are not-NA.
In [ ]:
message('Using base R...')
flights_complete <- flights[complete.cases(flights),]
print(dim(flights_complete))
message('Using tidyR...')
flights_complete2 <- drop_na(flights)
Are they the same datasets?
In [ ]:
identical(flights_complete,flights_complete2)
How might I obtain a summary of the original dataset?
In [ ]:
summary(flights)
R allows easy application of descriptive function along an axis.
any
and all
, which we used earlier, is an example of that. If the data are boolean, any
collapses a series of boolean values into True if any of the values are true. all
collapses a series of boolean values into True if all of the values are true.
What's the mean air time?
In [ ]:
mean(flights_complete$air_time)
Can we calculate the mean for multiple subsets of our data at once?
In [ ]:
timedat <- select(flights_complete, air_time, dep_delay, arr_delay)
Find mean of each row...
In [ ]:
head(apply(timedat,1, mean))
Find mean of each column...
In [ ]:
apply(timedat,2, mean)
Find mean of each column...yet another way...
In [ ]:
lapply(timedat, mean)
Let's fix the formatting...
In [ ]:
sapply(timedat, mean)
Using dplyr
...
In [ ]:
timedat %>% summarise_all(mean) %>% print()
Sometimes you may want to perform some aggregate function on data by category, which is encoded in another column. Here we calculate the statistics for departure delay, grouping by origin of the flight - remember this is the greater NYC area, so there are only three origins!
First let's use tapply
:
In [ ]:
tapply(flights_complete$dep_delay, flights_complete$origin, summary)
That code is a bit messy, so using the with command for indicating the parent data frame...
In [ ]:
with(flights_complete,
tapply(dep_delay, origin, summary)
)
Using dplyr
...
In [ ]:
flights_complete %>% group_by(origin) %>% summarise(avg_dep_delay=mean(dep_delay)) %>% print()
The last sections have used the operator %>%
. This symbol is called a pipe. It was introduced in the magrittr
package, but dplyr
and tidyr
also take advantage of this syntax.
Pipes %>%
exist because they help tidy up commands when performing a chain of operations. When we want to provide to a function1
some data which is output of a function2
, whose input is output from function3
, we can end up with some very nested, difficult-to-read commands:
function1(function2(function3(data,parameters3),parameters2),parameters1)
Sometimes, with
may help simplify your commands, as above, but piping can be more direct.
data %>% function(parameters)
is exactly the same as function(data,parameters)
But the code is read (and written) from the left to the right (and not inside-out) and is easier to understand.
Let's remove rows with missing values (NA) from flights
dataset, then calculate the average departure delay in one call, first using basic syntax then using pipes.
Using standard R syntax:
In [ ]:
with(flights[complete.cases(flights),],
tapply(dep_delay, origin, summary)
)
Same operation using pipes and dplyr
:
In [ ]:
flights %>% filter(complete.cases(.)) %>%
group_by(origin) %>%
summarise(avg_dep_delay = mean(dep_delay)) %>%
print()
flights %>% drop_na() %>%
group_by(origin) %>%
summarise(avg_dep_delay = mean(dep_delay)) %>%
print()
Notes:
.
stands for the data frame in the call of type: data %>% function1(function2(data))
.%>%
should be in the end of a line.You will likely need to combine datasets at some point. R provides quite a few tools to do that, and as you've seen, it's possible to do something many different ways.
Here, we present a simple case of 'vertical' merging, using rbind
. Let's create a data frame with information on flights by United Airlines and American Airlines only, by creating two data frames via subsetting data about each airline one by one and then merging.
The main requirement is that the columns must have the same names (may be in different order).
Subsetting the dataset to make 2 data frames:
In [ ]:
flightsUA <- flights[flights$carrier == 'UA',]
flightsAA <- flights[flights$carrier == 'AA',]
Checking the number of rows in two data frames:
In [ ]:
print(nrow(flightsUA) + nrow(flightsAA))
Combining two data frames then checking the number of rows in the resulting data frame:
In [ ]:
flightsUAandAA <- rbind(flightsUA,flightsAA)
print(nrow(flightsUAandAA))
Nothing special, just be sure the data frames have the columns with the same names and types.
A useful tip is to use do.call
in order to merge more than two data frames.
do.call
is a function that applies a function to a list of elements.
rbinding
3 data frames and checking the number of rows:
In [ ]:
print(nrow(do.call(rbind, list(flightsUA,flightsAA,flightsUAandAA))))
'rbind' is really useful for populating a data frames, but it can be a bit slow within loops. Each time we append a row to a data frame within a loop a new copy of a data frame is stored in memory :(
The solution is to create a list of lists and then merge them with do.call rbind
combo. But since rbind
, as many native R functions, is slow and not memory-efficient, for large datasets one may want to use
rbindlist
function from data.table
package, which does the same operation, but faster.
Let's compare these approaches using the system.time
function to see the execution times.
In [ ]:
message('execution time for rbind')
system.time(do.call(rbind, list(flightsUA,flightsAA,flightsUAandAA)))
message('execution time for rbindlist, written in C')
system.time(rbindlist(list(flightsUA,flightsAA,flightsUAandAA)))
And now the example of using rbindlist for populating a data frame.
Let's pretend we forgot how to use tapply
or group_by
(dplyr
) and we want to calculate the average arrival and departure delays per airline.
In [ ]:
Start <- Sys.time()
carriers <- unique(flights_complete$carrier)
resList <- list()
for (i in 1:length(carriers))
{
meanDepDelay <- mean(flights_complete[flights_complete$carrier == carriers[i],]$dep_delay)
meanArrDelay <- mean(flights_complete[flights_complete$carrier == carriers[i],]$arr_delay)
resList[[i]] <- list(carriers[i],meanDepDelay,meanArrDelay)
}
DelaysByAirline <- rbindlist(resList)
colnames(DelaysByAirline) <- c("carrier","meanDepDelay","meanArrDelay")
End <- Sys.time()
message('It took us')
print(End-Start)
message('and here is the result for Amercian Airlines')
print(DelaysByAirline[DelaysByAirline$carrier == 'AA',])
message('Same result without messing with loops')
Start <- Sys.time()
flights_complete %>% group_by(carrier)%>%
summarize(meanDepDelay = mean(dep_delay), meanArrDelay = mean(arr_delay))%>%
filter(carrier == 'AA') %>% print()
End <- Sys.time()
message('And it took us')
print(End-Start)
In most cases loops are possible to avoid, but if you have to write one, the "list of lists" + rbindlist
approach may save you a lot of time.
In [ ]:
print(head(airports))
The airports
table gives us a key! Let's merge the flights
data with the airports
data, using dest
in flights
and faa
in airports
.
This is pretty easy in base R...
In [ ]:
flights_readdest <- merge(flights_complete, airports, by.x='dest', by.y = 'faa', all.x=TRUE)
head(flights_readdest)
And you can do it in dplyr
too...
In [ ]:
flights_readdest2 <- left_join(flights_complete, airports, by = c("dest" = "faa"))
head(flights_readdest2)
Why did we use all.x = TRUE
and left_join
?
In [ ]:
setdiff(flights$dest, airports$faa)
Well this merged dataset is nice, but do we really need all of this information?
In [ ]:
colnames(flights_readdest)
In [ ]:
flights_sm <- select(flights_readdest,origin, dest=name, year, month, day, air_time)
head(flights_sm)
Why would you want to ever use select
? dplyr
lets you chain operations using the pipes, as discussed above. Let's calculate the average air time for various flight paths, using origin and the readable version of destination airport.
In [ ]:
airtime <- left_join(flights_complete, airports, by = c("dest" = "faa")) %>%
select(origin, dest=name, air_time) %>%
group_by(origin, dest) %>%
summarize(avg_air_time = mean(air_time))
print(head(airtime))
print(dim(airtime))
What's the longest flight from each airport, on average?
In [ ]:
with(airtime, by(avg_air_time, origin, max))
with(airtime, tapply(avg_air_time, origin, max))
airtime %>% group_by(origin) %>% summarise(max(avg_air_time))
Let's find the flight info for the max airtimes from each airport.
In [ ]:
with(airtime, tapply(avg_air_time, origin, function(x) which.max(x)))
message('That\'s the index in the grouped version, but not very useful')
setDT(airtime)[,.SD[which.max(avg_air_time)],by=origin] #data.table package, http://stackoverflow.com/questions/30294088#comment48685310_30294186
airtime %>% group_by(origin) %>% filter(avg_air_time == max(avg_air_time))
In [ ]:
spread(airtime, origin, avg_air_time)
Notice the NA in the last dest row. Can you guess what happened? (This was human input error; we skipped a step in our analysis.)
In [ ]:
head(weather)
intersect(colnames(flights_complete), colnames(weather)) # What columns do they share?
In [ ]:
flights_weather <- merge(flights_complete, weather,
by=c("year", "month","day","hour", "origin"))
print(dim(flights_complete))
print(dim(flights_weather))
What if you only want to look at long delays?
In [ ]:
flights_weather_posdelays <- filter(flights_weather, dep_delay>200)
print(dim(flights_weather_posdelays))
In [ ]:
flights_weather %>% arrange(desc(dep_delay)) %>% head(10)
flights_weather %>% arrange(dep_delay) %>% tail(10)
In [ ]:
flights_weather %>% arrange(dep_delay) %>% head(10)
In [ ]:
print(head(tolower(flights_complete$dest)))
print(head(toupper(airports$name)))
Our current flights_complete
data frame is wide-formatted data. Each variable (measurement) for an observation (flight) is a separate column in that row.
For some plotting functions, including ggplot
in R, it's easier to plot data for multiple series on the same plot axes if the values are all in one column, and the series label is another. To do that, we need to format our data in the long format. In this case, we're going to plot both arrival and departure delays on the same plot, so we need to create a column for "delay time" (here called value
because of the melt
function) and another column type_of_delay
to encode the series.
In [ ]:
print(head(flights_complete))
Using reshape
...
In [ ]:
day_delay <- melt(flights_complete, id.vars=c("time_hour"),
measure.vars=c("dep_delay","arr_delay"), variable_name = "type_of_delay")
print(head(day_delay))
Using tidyr
...
In [ ]:
day_delay <- gather(flights_complete, `dep_delay`,`arr_delay`,
key = "type_of_delay", value="value") # note the backticks here!
print(head(day_delay))
In [ ]:
ggplot(day_delay,aes(x=time_hour,y=value,colour=type_of_delay, group=type_of_delay)) + geom_point()
Well this is a bit hard to read. What about the first entry for each type of delay in each hour?
We're going to use the distinct
function from tidyR
and first group by time_hour
and type_of_delay
before removing duplicates. We don't want to remove an entire series as would happen if you only grouped by time_hour
on your melted data frame.
In [ ]:
day_delay_first <- distinct(day_delay, time_hour, type_of_delay, .keep_all = TRUE)
# keep all columns
print(head(day_delay_first))
In [ ]:
ggplot(day_delay_first,aes(x=time_hour,y=value,colour=type_of_delay, group=type_of_delay)) + geom_point()
In [ ]:
ind <- which(is.na(flights), arr.ind = TRUE)
print(head(ind))
message('Let\'s make a handy summary using table which gives frequencies of each value:')
ind2 <- table(ind[,2])
print(ind2)
But what do those numbers, 4-15, mean? They're column numbers. We can use colnames
to get the original column names, and names
to access and rename the ind2
vector. (It's not a matrix/data frame so it doesn't have column names.)
In [ ]:
names(ind2) <- colnames(flights)[as.numeric(names(ind2))]
# This is kind of fancy-looking, but we are converting the names of ind2 to numeric because
# they are strings, and using those numeric indices to get the right names from colnames
print(ind2)
Let's just grab our rows with NAs.
In [ ]:
flights_incomplete <- flights[!complete.cases(flights),]
print(dim(flights_incomplete))
Do flights with NA departure time also have an NA departure delay? One way to tell is to check if number of instances where both the departure time and the departure delay are NA is the same as the number of instances where at least one of them is NA.
In [ ]:
print(table(is.na(flights_incomplete$dep_time) & is.na(flights_incomplete$dep_delay)))
print(table(is.na(flights_incomplete$dep_time) | is.na(flights_incomplete$dep_delay)))
Yes. (My sneaking suspicion is that the flights were cancelled.)